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Abstract 

This paper develops a parallelizable multilevel multiple constrained nonlinear 
equation solver. The substructuring process is automated to yield appropriately 
balanced partitioning of each succeeding level. Due to the generality of the pro- 
cedure, both sequential, partially and fully parallel environments can be handled. 
This includes both single and multi processor assignment per individual partition. 
Several benchmark examples are presented. These illustrate the robustness of the 
procedure as well as its capability to yield significant reductions in memory 
ultilization and calculational effort due both to updating and inversion. 
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I. Introduction 


For nonlinear static and implicit transient finite element or difference simu- 
lations, generally some form of Newton Raphson 11 ' 31 iterative algorithm is typically 
employed to solve the resulting nonlinear set of equations. Prior to the 1980’s, 
generally the classical version of the scheme 11 " 31 was the preferred method. With the 
advent of constrained adaptions, problems exhibiting postbuckling behavior can now 
be standardly handled. To date a wide variety of constraint procedures have been 
developed, for instance 

i) Arc length control 141 

ii) Linear 151 , circular 16,71 , elliptic 18,91 and piecewise continuous 1101 constraints as 
well as 

iii) The enforcement of bounds on successive stress, strain and energy 18,10,111 
excursions. 

More recently, modifications have been introduced to partition and parallelize 
the scheme. These include a variety of different approaches, i.e. 

i) The least square method 1121 

ii) The mixed direct-iterative solution of the stiffness matrix 1131 

iii) The use of multiple/multilevel constraints 114,151 and 

iv) The use of progressive substructuring 1161 . 

Overall the foregoing 14161 procedures have greatly widened our ability to tackle 
highly nonlinear response problems including the interaction of contact, large defor- 
mation kinematics, complex material response, postbuckling, etc. 

The thrust to parallelize the Newton Raphson family of algorithms has risen 
out of the need to handle ever increasing problem sizes. The heart of the difficulty 
lies in the storage and inversion of the tangent matrix. While attempts to use 
progressive substructuring 113,16,171 point to potentially significant gains, overall the 
approach typically yields hit or miss improvements 118 " 201 . This follows from the fact 
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that generally no attempts have been made to balance memory and computational 
efforts. As a result, progressive substructuring can lead to an imbalance between 
internal and external variables yielding potentially increased costs in 

a) communications 

b) memory and 

c) computational effort. 

In the context of the foregoing, this paper will develop a parallelizable 
multilevel constrained nonlinear equation solver. To provide for a proper balance 
between the internal and external variables associated with successive levels, the 
hierarchical poly tree (HPT) of Padovan and Gute 118 ' 201 will be employed. This will 
enable the minimization of computational effort, communications and memory 
requirements. Concomitantly, the multilevel substructuring process will be auto- 
mated to yield the appropriate partitioning of each succeeding level. 

Due to the generality of the procedure, both sequential, partially and fully 
parallel environments can be handled. Two approaches to parallelism will be under- 
taken, i.e. 

1) Where single processor assignment is defined for each partition and 

2) Where multiple processor assignment is employed. 

To further extend the range and capability of the scheme 

i) A variety of local partition level update schemes will be explored, namely 
BFGS 121] , full, andmodified[l-3] 

ii) Local and global convergence criteria will be developed and 

iii) A variety of constraint procedures will be explored, i.e. global/individual 
partition level/. . . etc. 

In the sections which follow, detailed discussions will be given on problem 
formulation, hierarchically substructured constrained solvers and automated sub- 
structuring. The results will be benchmarked in a series of numerical examples. 
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II. Problem Formulation: Assessment of Current Capabilities 


For structural systems composed of nonlinear media undergoing potentially 
large deformations/strain excursions, the governing equations of motion take the 
form 


a , a \ 1 d 2 

— S.J 5.. + — (£/.)) + p/\ = p — (U.) 
ate. I J k v «* .ax. * /J H d< 2 « 

7 * 


( 1 ) 


where S jk , Up fj, p, 8 ik , Xj and t respectively represent the 2nd Piola Kirchoff stress 
tensor, the displacement vector, body force, density, Kronecker delta, Lagrangian 
coordinates and time. The boundary conditions associated with (1) are given by !22J 
i) for V x e 3V: 


+ — On'ln. = S* 

jk\ ik i ) j i 


ii) for V x e 3V U ; 



( 2 ) 


(3) 


where nj, Sj* and U s * respectively define the surface normal, prescribed surface 
traction (on 3V S ) and displacement (on 3V U ). Given the use of the 2nd Piola Kirchoff 
stress measure, the stress-strain relation will be cast in the form 1221 


S..~ S..(L,.,L . 0 , . . .) 

ij ij 1 !' 12 ’ 


where L- are components of the Green-Lagrangran strain tensor, i.e. 


L ..= ~{U. . + U. . + U, .£/, .} 

ij 2 l ’J J .* { > l { >J 


Assuming a displacement type formulation, it follows that 


U = [N]Y 


(4) 

(5) 

(6) 
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( 7 ) 


— (U) = ftf]Y 
dt 2 

where [N], Y and Y are the shape function, nodal displacement and nodal accele- 
ration. Based on the use of the virtual work principle, (1-7) yield the following FE 
formulation namely 11 ' 31 


[B*) T S dv + [M] — (Y) = R 


dr 


(8) 


such that S, [M], R are respectively the 2nd Piola-Kirchhoff tensor cast in vector 
form, the lumped/consistent mass matrix, the nodal force vector and 11 ' 31 


[B*) = lB] + [B N m 


(9) 


Since (8) is highly nonlinear, both its static and implicit transient solution 
requires introduction of the tangent stiffness formulation namely 


f [B’fSdw-f 

J V J \ 


[B*f S dv 


+ [K T ] ay 


( 10 ) 


where' 1 ' 31 


[K t ] = J {IGflSHGH- [B*f[D T ][B*]}dv 


(ID 


such that [S] and [D T ] are the prestress matrix and material tangent stiffness. Based 
on (10), (8) reduces to the form 


[Ml — (Y) + HU AY 
dt 2 T 


= R -f, 


[B*Y Sdvl 


( 12 ) 


Employing various of the implicit solvers 1231 , (12) can be recast to yield the 
expression 


i'y*Y=R 0 -j v 


[B'YSdul 


(13) 
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such that [K d ] and R D are respectively the dynamic stiffness and load, i.e. a function 
of At the time step. For static problems, (13) reduces to 

ltf J JAY = R- J^(B*] r Sdo| Y (u) 

Equations (13) and (14) define the core Newton Raphson relation. Cast in 
algorithmic form, we yield the expression 


iv Y L» iY f= R i- 


[B’fSAil . 

V Y J . , 

1-1 


(15) 


such that j defines the load/time step increment and i the iteration count for the 
given increment. 

To control/constrain successive iterate excursions, typically a one parameter 
bound is introduced to scale R D * namely 14 ' 10 ' 


& Y Hv Y L)r‘ N AR i> + 


J V Y*. 


(16) 


where is defined by a constraint condition. Considering the case of an elliptic 
constraint surface 181 , Fig. 2.1, X) is chosen to satisfy the relation 


PHY/ f + A^lIR^It 2 = HR^il 2 (17) 

such that p is the aspect ratio of the ellipse. The parameter p can be selected by 
various of the following criteria 1101 , i.e.: 

1) By defining an allowable excursion for ||A Y.j||, i.e. a global condition; 

2) By restricting the allowable excursion of a given node (a local condition) or 

3) By restricting the allowable excursions in stress/steam or energy - either 
globally or locally. 

The algorithm (16) can be updated in three different formats 
1) Fully — [Kj.fY.j)] is contiguously updated and inverted 
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2) Modified - [Kp (Y f j)] is intermittently updated and inverted - at the 
beginning of each new load step or 

3) BFGS - [Kp (Y 0 j )] is updated by pre and post multiplication via appropriate 

rescaling metrics 110,211 . ■> 

These update strategies pose the main short coming to the Newton Raphson scheme. 

In particular the updating and inversion of [Kp ( Y^)] requires significant storage and 
computational effort. 

Considering 2-D and 3-D square and cubic regions with N nodes on an edge, it 
follows that asymptotically the following work load (co) and memory (p) require- 
ments are defined namely: 

i) 2-D: Fig 2.2 

“ 2 D ~°W*I- 

Pjo “* 0 (N 3 ) (18) 

ii) 3-D: Fig 2.2 

co^-OW 7 ) 

P 30 -OW 5 ) (19) 

These trends obviously point to very disturbing consequences as problem size/ 
complexity grows. 

Substructuring can potentially reduce the calculation burden. For the classic 
two level case, considering the foregoing 2-D square region, it follows that for parti- 
tioning into k 2 square regions, the net effort is defined by a two tiered expression, i.e. 

* 

U 2D (20) 

where (co 2D )j, and (co 2D ) 2 are espectively the work loads associated with the root and 
branch levels. Recalling the work of Padovan, Gute and Johnson 1181 , it follows that 
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the operations count can be defined by tracing the number of row and column multi- 
plications and additions. Note here an operation will be defined as a multiplication - 
addition pair. Several forms of operational control are possible for the two tiered 
format, in particular: 

1) All substructure are handled sequentially by a single processor, Fig. 2.3 

2) For a P processor machine 

• Each of the P processors can be individually reattached to separate new 
partitions upon completion of work in prior assignments. Fig. 2.3 or 

• Sets of processors may be continuously reattached to partitions - single 
assignment to a branch multiple at root, Fig. 2.3 — multiple assignment 
at both branch and root, Fig. 2.4. 

Performing the requisite operations count, asymptotically the work load at the 
root and individual 2nd level partitions is given by the relations 


3 (N) 3 / 9/c — 24 \ o 

:® 4« ( k-, K 


(21) 

„ 9 3 /W)V* 

^ . 

(22) 


such that E r , E p , k and Z, respectively denote the root effort, 2nd level partition effort, 
the number of partitions on an edge and the number of degrees of freedom per node. 
Note the net global work load is given by 


3 /kt \ 4 


E G =-(&m 


(23) 


Employing (23) to normalize the effort counts of the sequential and (partially/fully) 
parallel cases, we yield the following relationships: 
i) Purely sequential (single processor machine) 
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(24) 



|- ( 'j K 2 — — — | 

( 9k— 24 \ 

VeJ 

VeJ 4N ' 

< K-l ) 


(«r 


ii) Partially (sub) parallel (P processor machine, single assignment, (k) 2 >P) 


( — ) "I 


be) 2 

_ JLI 

( 9k — 24 \ 

' E n Jps 

^ J ^ J 

P 

4 N ' 

V K-l J 


(k) 2 P 


(25) 


iii) Partially (sub) parallel (P processor machine, multiple assignment) 


v 

ll.l 


i M 2 

— 1 

f 9k— 24 ' 

\. 9 


>p 

1 P + ‘ 

‘V 

’ P 

4 NP ' 

^ /C — 1 / 

' (k ) 2 P 

(26) 


iv) Fully (iso) parallel (k 2 processor machine, multiple processor assignment) 


f U>2D \ -f Er \ 1 ( Ep \ 1 ( 9 k — 24 \ 9 

\ E c ) P “ \ E g ) k 2 + \ E g 1 ~ 4 Nk \ k- 1 I + (k) 4 (27) 

v) Super parallel (P > k 2 processor machine, multiple processor assignment) 


/' u 2d'\ ( E p\W 2 

\ E c )p~\E c ) P + \E g ) P (28) 

Based on the trends defined by (24)-(28), it follows that significant improve- 
ments are possible for both the sequential and sub/iso/super parallel arrangements. 
This must be tempered by the fact that in the case of multiple processor assignment, 
significant loses in efficiency occur - a direct result of Amdhal’s law. Noting Table 
5.6 which illustrates the effects of the number of processors on overall efficiency, we 
see that significant reductions are recorded as (P,N) are increased. In this context, 
the P in multiple assignment areas must be replaced by 

P e = P e (P,N) (29) 

such that P e denotes the effective number of processors. 
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All this points to the fact that great care must be taken in balancing the 
number of substructure as well as their external to internal variable ratios. For 
instance noting the subparallel work effort ratio defined by (26), it follows that 
replacing P by P e , i.e. (29), we yield the expression 



(30) 


Given realistic machine configurations, P e generally reaches a saturation value, i.e. 
P s . In this context (30) reduces to the form 


f a 2D) 1 f k ( 9k — 24 \ _9_J 

\ E c )p~ P s l4N\ k-1 ) + (K) 2l 


(31) 


Here we see that the delimits controlling improvement are k and N. For a given N, 
(31) illustrates that there is a critical value of k yielding peak work load reduction, 
i.e. 


K critical ~* V&N 

This points to the fact that for a given machine configuration, the proper tuning of 
the substructuring process strongly influences speedup potential. 

Similar trends apply to memory usage. Note for the 2-D (k) 2 square region, the 
net memory requirement is defined by the expression 


^2 D ^w\ + ^2Zp2 (33) 

where (p 2D )j and (p 2D ) 2 are respectively the memory requirements associated with 
the root and branch levels. Performing the requisite count, the root and individual 
partition level memory requirements are defined by the expressions 


(p^), - - (0 2 [6 (k) 3 _ 80c) 2 - 2k + 



(34) 
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(35) 


(P 9n ), 




2D 2 


Based on (33-35), we again see that the optimal memory reduction is controlled by 
the (N,k) pair. To achieve further gains, one must move to multilevel substructur- 
ing. Additionally, procedures need to be established which enable the proper tuning/ 
balancing of the substructuring for general shapes. 

Beyond the inversion problem, difficulties also arise out of the need to update. 
For nonlinear problems, the update process is generally on the same order of magni- 
tude as the inverse problem. In this context the net computational burden is given 
by 


W net ^213* inverse ^ 20 ^ update 


(36) 


where typically 


^2D ) updaie‘' 0{{W 2D ) inverse ) 


^ ^ 2D* inverse 


(37) 


such that generally T e (0, 1). In this context, it follows that to parallelize the net 
effort, assuming partial parallelism 


1 NE 
1 £ 

"2D~ p 2. E u (38) 

wherein NE and E,/ are the number of elements and the update effort for the fth 
element. If the element level data is cloned off into partitioned sets, then the level of 
contention between processors is reduced especially for systems with some localized 
memory. Such characteristics tend to drive the architecture to large P, i.e. a large 
number of processors. Since the two level tree is k limited, a multilevel tree (MLT) 
would be required to enable a balanced growth of the number of separate top level 
partitions. Furthermore, there would have to be a balancing of multiprocessor 
reassignments due to system saturation. 
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m. Hierarchically Scaled and Partitioned Constrained Solvers 

As noted earlier, to provide a proper balance between externals and internals, 
depending on problem size and connectivity, several levels of substructuring may be 
necessary. Referring to Fig. 3.1, a multilevel tree (MLT) can be associated with the 
partitioning process. Such a MLT defines the flow of control of the various stages of 
intermediate forward elimination, backsubsitution and assembly. Overall the MLT 
consists of the top-intermediate branches and root. Element generation occurs at the 
top branches. The intermediate branches and root are a result of the successive 
stages of the forward elimination process. 

At a typical top level branch, the partitioned version of (15) takes the form 


A ^ 


( 39 ) 


W Y ^ AY M R i-L lB*) r Sdu| , 

a" 1-1 

where the various pre-post super-subscripts denote 
p - partition number 
A - top level branch 
j - increment number 
i - iteration count 

Introducing separate partition level constraints, (39) yields the following expression 
namely 


a (Y i- |M A AY f = P A R i~ ' + P A< AR il P aM - Jp v S dv I 


Y i-1 


(40) 


such that here 


> R i> = 


(iRj), O 


0 


( 41 ) 


II 




( 42 ) 


where 

A p X) - p th partition constraint of A 1 * 1 top level branch 

E Xjj - common root level 

Note the parameter is used to provide a global constraint on the solution. 
Overall it controls the I/O among the various participating substructure. Its solution 
is obtained once the root equations are assembled. The local parameters a p AJ control 
the flow of partition level I/O into the system. In this context they serve a dual role 
namely: 

1) To control the flow of A p (AR D j ), as well as, 

2) Provide a bound on the a p ATj j generated during successive partition level 
iterations. Their values are established via a local partition/substructure 
level solution. 

Overall E X- } and a p A/ are obtained in a three pass operation. This is described 
below. 

1. First Pass . Initially the local and global constraints are set equal to each 
other, i.e. 


P l j = p \ j 

Ei At 


(43) 


for V p and A. Hence (40) reduces to the form 


l»A AY i = A R d' + eH( P A AR D> - 


„ [B*) T Sdv\ 

P .V Y' , 

A i-l 


(44) 


Next (44) are solved in a substructural sense, eliminating all local top branch 
internal variables, i.e. 


12 


P a<* y A 


This is achieved by restructuring the local tangent stiffness matrix and incremental 
displacement vector into the following form namely 


= 


'AY' = 

A i 


i k d 

1 


1 

e k d_ 

(45) 

f (AY >/ 

V 


l (ay > £ 

), 

(46) 


In terms of (45) and (46), (44) can be rearranged to obtain an expression for the 
external incremental displacement of each (p,A) top branch, in particular: 


where 


P Md"1> = eM <^ r d >£ + 

r E = W\-(J Pv ,IB*fS^ ) - 

A i-1 

- RX (V !« , / K D (Y i ir ‘ > ( L v i B 'i Tsdv L -), 


(47) 


(48) 


A ‘i-1 

Assembling (48) at the next branch level, (A-l), we obtain the following intermediate 
expression 


= A-W + A-? 1 < 49 > 

such that the (A-l) presubscript denotes the assembled coefficients and dependent 
variables. Restructuring into the (A-l)th level externals and internals yields the 
relation 
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Repeating the process recursively for each of the requisite branch levels, we obtain 
the following algorithmic relationships namely 


= W ( A-W + A-? - 


(51) 


A-?* K ’p (Y '»<A-? &Y i>S = + A-?* (52) 

where € c [0, . . . , A-l]. Once the root level is reached, we can solve for E \i and j p AYj j . 
This is achieved by simultaneously satisfying the algorithms 


x AYf = ^(Ypr 1 {X* (jARjr,) + t r} (53) 

and 

jPljYf f + (X^R^II) 2 = BjR^f (54) 

where p is chosen from the inequality 




Max 

v(p.e> 


liL^n 2 

C-e R il 2 


(55) 


2. Second Pass . The results obtained in the first pass are backsubstituted up the 
MT.T to yield the intermediate and top level dependent fields. To obtain the 
corrected top level results, local iterations are necessary. The requisite algorithms 
are given by the expressions 


= aM W , + A r ; (56> 

and 

. • ' / K Y i^ ( AW = W 2 (57) 

where the local (p, A) scaling parameters are defined by the inequities 



(58) 


P., _ Max 
A h “ V(p, A,*) 


AY 


such that AY a defines the allowable individual degree of freedom excursion. The 
iteration process is continued until all top branch formulations are converged. 

3. Third Pass . Once the second pass iteration process is complete, recursive passes 
up and down the tree can be employed to yield global convergence for the given load/ 
time step. All this is illustrated in Fig. 3.2. 

The foregoing multipass algorithms can be performed either sequentially or in 
a partially-fully parallel format. Recalling Fig. 2.3, the sequential multilevel 
adaption possesses several branch levels, i.e. Fig. 3.3. When the processor is 
stationed at a particular substructure, it must perform several functions. These 
include: 

1) Update local stiffness - for top level only; 

2) Forward elimination - forward pass; 

3) Back substitution - backward pass; 

4) Matrix assembly - forward pass; 

5) Newton Raphson iteration -top and root levels. 

For multilevel subparallel applications, the flow of control is given in Fig. 3.4. As 
with the sequential case, the individual processors are also reassigned to the tasks 
devoted by l)-5) above. This can be achieved in either a single or multiple processor 
assignment process. 

Note the partition leap frogging described in the flow diagrams given in Figs. 
3.3-3.5 is a result of the subparallel nature of the setup, i.e. P is less than the number 
of top branch partitions. Note, the leap frogging occurs both on a given branch as 
well as between succeeding levels. As the solution process moves down the tree, the 
number of partitions on succeeding levels reduces. In this context, while the upper 
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levels may be subparallel, the lower ones particularly the root will be superparallel. 
For MLT wherein the work load of each level is uniformly distributed among the 
associated partitions, then the scheduling problem is fairly well defined. In particu- 
lar as the process moves from sub to iso to super parallel levels, multiple processor 
assignment occurs. In such situations, scheduling difficulties are somewhat miti- 
gated by the fact that work load uniformity regulates the problem. For problems 
involving localized material or boundary induced nonlinearity, the updating and 
iterative convergence will create complications for those partitions where multiple 
assignment is scheduled. Note such scheduling problems may be handled directly by 
the system compiler-operating system which directs the reassignment of processors 
to the ongoing tasks. Figure 3.6 illustrates the reassignment process associated with 
superparallel situations. Note once Ps the saturation level is reached, no more 
reassignment should be continued at that partition. 

To close the discussion on the solver algorithm, we must address the issues of 
updating and convergence checks. Concerning updating, as noted earlier, three 
forms are possible. For highly nonlinear zones, continuous updating is typically 
necessary, i.e. for regions involving contact-impact, complex media (hyperelastic, 
inelastic, . . ), large strains/rotations, and complex boundary interactions (follower 
forces . .). In regions which are primarily linear elastic but undergoing moderate 
rotations, the BFGS scheme can be employed to effect a quasi update. Traditionally 
the method has been employed in a global context. Here it can be employed at the 
local partition level. 

In tree applications, several considerations are possible. These include: 

1) Top level branches where direct element updating occurs; 

2) Intermediate branch levels which are linked directly to top partitions 
undergoing updating; 
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3) Intermediate branch levels which are connected to a mix of up and non- 
updated top partitions and 

4) Top and intermediate levels not undergoing updating due to association 
with essentially modestly nonlinear substructural zones. 

The foregoing points to two issues, namely 

1) when to update and, 

2) how to update -full or quasi (BFGS). 

Several investigators have considered the problem of when to update* 10,14,161 . 
For instance in the work of Sheu et. al. ,16] , two criteria were employed namely 

i) Incremental ratio tests of the normed out of balance loads or deflections 
and the appropriate reference states and 

ii) Evaluation of successive variations in incremental energy states. 

In fact a wide variety of possible flags can be established. Depending on problem 
type, these can be categorized into several basic tests 

1) Measures of large rotation/small strain 

2) Large strain/volumetric/shape distortion 

3) Material nonlinearity and 

4) Bounding induced effects. 

Overall, the tests can be grouped into two classes of solution checks, i.e.: 

1) Direct-performed at the top level and; 

2) Indirect-performed primarily at intermediate and root levels. : 

The direct monitoring involves the evaluation of the conditioning of the localized 
dependent field variables. This includes both kinematic and stress measure 
namely 1101 

• Invariants of the deformation gradient, Green Lagrange measure, Fingers 
tensor, Cauchy stress, . . etc. 

• Rotations associated with designated target nodes 
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• Dilatation 

• Element eigenvalues and so on. 

Note since the stress and kinematic states define natural response, all system 
characteristics are covered. To assess conditioning, incremental variations can be 
ratio tested. For instance, if we consider the Green-Lagrange invariants I p I 2 , 13, the 
element level test would have the form 


Tol> 


wh 


kell ,3) 


(59) 


such that e designates the element number and AI k the averaged k 1 * 1 invariant. 
Similar tests could be run on the other field variables. Overall such a test would be 
run for the top level branches only. This would gage the need to perform element 
level updating. 

The indirect tests are performed to evaluate the state of the intermediate 
branches and root. Since a typical intermediate branch may be composed of various 
updated and nonupdated top level partitions, its update is contingent on the degree 
to which local effects penetrate to lower levels of the tree. Since it is less meaningful 
to define the kinematic and stress fields at such tree levels, we resort to gross/norm 
states. This can be achieved for both the global and intermediate branches. The 
tests consist of ratios of incremental variations in out-of-balance load, deflection, 
energy, inelastic growth, gross partition rotation and volume/area change. At the 
intermediate partition level we have 

i) Out-of-balance load 


l£AFj|| 

l£Afl£|| (60) 
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ii) Deflection 


iii) Energy 


iv) Inelastic growth 


v) Gross rotation 


Tol> 


I^AY'll 

IjAYj 


Tol > — : 

K 


Tol > 


p A J . 

t pi 

paJ 

pi 




( 61 ) 


( 62 ) 


( 63 ) 


Tol > 


p Aco J 

l l 


P M 


where || || defines the Euclidean norm, and 


*AF' = *R J _-[ [B*] T Sdv\ . 

{ * A D Jp v l * Y J 

A i-1 


( 64 ) 


( 65 ) 


' B i= -[B*f S| )* 

A i-1 


(66) 


For the gross problem and root levels, two tests are possible namely 

1) Those restricted to strictly the root variables or 

2) Those involving all or various of the branch level externals as one tele- 
scopes from the root to the top of the tree. 

Such tests are exemplified by the following expressions: 

1. Root only: 
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• Out of balance loads 


• Deflection 


• Energy 


Tol > 



Tol > 


1,^1 


,E J . 

Tol > 

i E i 


2. Telescoping (all levels) 
• Out of balance loads 


• Deflection 


• Energy 


1 1 « 

r»i> — 

IIM 

* P 


II I^AYf'|| 

Tol > -j — 2 

II Ml 

f p 


( 67 ) 


(68) 


(69) 


(70) 


(71) 
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( 72 ) 


I I# 

T °l> T 

II K 
{ p 

Contingent on modelling needs, one or more of the foregoing tests can be 
implemented. 

If the foregoing tests remain below a specified tolerance limit, then a modified 
version of the BFGS scheme can be employed to update local partition level stiffness. 
This enables reasonable iterative convergence while bypassing the need for an 
inverse at the given substructure. Adapting the scheme to a partition level 
application, it follows that the updated local stiffness takes the form 

A-iKvfa = w/ *-?!*;> fffn, = w,! < 73) 

where 


[<t>.] r = [/] + V.(W.) r (74) 

The vectors V. and W f are calculated from the known nodal point forces and dis- 
placement using the relations 


V. = -y - 
1 


«A 


< 6 /a— 


S a-?kX>'« ( 


(75) 


and 


W.= — —8. 
1 (8 ,) r Y. ' 


(76) 


where 


8. = y -Y , 

I *1-1 


(77) 
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v. = F. — F. , 

*1 l l-l 


( 78 ) 


Due to the generality of the formulation described by (73)-(78), BFGS type 
updating can be applied to any level in the MLT and any partition of that level. 
Note, near bifurcations and turning points, buckling, the full update should be 
employed to yield the proper transitional characteristics. This also holds true in 
inelastic/history-dependent processes wherein the proper updating is needed to trace 
the plastic event. As the calculations proceed down the MLT, it is possible that an 
isolated local zone can be handled via BFGS updating at intermediate levels. This 
would greatly reduce the work load. 

IV. Automated Substructuring 

Overall to automate substructuring four basic steps are required. These 
include: 

i) Symmetry checks on both geometric and nodal topology 

ii) Establish multilevel partitioning 

iii) Bandwidth minimize each individual substructural component and 

iv) Optimize memory and work load by selecting best number of levels and 
decomposition per level. 

Often times structure have localized symmetries which, due to loading and 
boundary conditions, cannot be taken advantage of in a global sense. Such 
symmetries can be at the root — intermediate — top levels. To determine such 
attributes, an automated symmetry check must be used to help optimize the 
substructuring. While visually such properties are easily spotted, from a numerical- 
analytical point of view, such is not the case. Here we adopt the following multistep 
procedure namely 

i) Find CG 

ii) Determine inertia tensor (I) relative to CG 
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iii) Find principal coordinates and components of (I), i.e.(I p ) 

iv) Establish nature of symmetry based on 

• properties of (I R ) 

• coordinate check about principal axes 

v) Define nature of partitioning based on symmetry type and 

vi) Perform such checks recursively at succeeding levels. 

From an FE point of view, the determination of the global/substructural CG 
can be established by employing the element properties. Specifically at the f' 1 * 1 level 
and p 1 * 1 partition. The associated elements define the following expression 


<( x v x v x z 


r 


(X y -X 2 ,XJdu 


( 79 ) 


where 


pv= Y pv 

t t e 

e 

such that 

e p V e - element volume 
py _ net volume of (p, €) pair 

Based on the CG location, the inertia tensor takes the form 

e^CG* = — jp ^ l CG^ dv 

« te 

such that / ( ) dv defines typical Gauss quadrature and 


xl + xl 


~ X 2 X 3 ' 

- X 2 X 1 

X ' + X l 

- X 2 X 3 

— V. 

~ X 3 X 2 . 

X l +X l 


( 80 ) 


( 81 ) 


( 82 ) 
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Since [I CG ] is a 2nd order tensor, its principal orientation and properties can be 
established in the usual manner, this yields 



(83) 


As noted earlier depending on the makeup of [I ], various symmetries can be 
identified in particular 

i) If two components are repeated, there is planar symmetry 

ii) If three are repeated, there is 3-D symmetry. 

If no repeated roots exist, then potential symmetries about each principal can be 
ascertained via a sum check. Specifically, for nodes symmetrically placed relative to 
the CG, the following identies hold 


'y Xj. < Tol. (1 axis symmetry) 
i 

'y' X 2i < Tol. (2 axis symmetry) 
i 

X 3 . < Tol. (3 axis symmetry) 
i 

where the tolerance depends on user expectations. 

In the case that repeated roots are found, then the foregoing sum tests are 
automatically satisfied about any Cartesian coordinate system, i.e. arbitrary planar 
orientation for two repeated roots and arbitrary 3-D orientation for three roots. For 
the case of I . =1 0 (two roots), to find the symmetry axes we search for the max 

pi pZ 

points, i.e. outer bounds of the object. These will lie in a skew symmetric format. In 
this context we search for 

X lM =Max{X i )ic[\, p ( N] (85) 
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where M defines the outlier node, i.e. (X 1M , X 2M , X 3M ). Once obtained, the 
appropriate transformations can be established. For 2-D/3-D situations these are 
represented by the relations 
i) 2-D; 

a = ™Ar‘<W <86> 


ii) 3-D 



COS (a) 
-SIN (a) 


SIN (a) 
COS (a) J 



a 2 =TAN-'(X 3M /X m ) 


(87) 


( 88 ) 

(89) 



such that ( )* denotes the symmetry axis system. Note for repeated principal 
inertias, a multitude of such axis systems are possible. 

Once the nature of the available symmetry is established, the type of partition- 
ing must be chosen. In the context of the HPT recently developed by Padovan 
et.al. [18,19) , it follows that optimal results require the appropriate substructural 
arrangement. For instance, considering the 2-D case, there is either 1 or 2 axes of 
symmetry. These respectively require k x and Kj k 2 partitioning such that k. are even. 
In this way, in addition to yielding the proper choice of external-internal load 
balancing, maximum use can be made of cyclic substructural generation for linear 
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elements. Such cloning can yield respectively at least a two and four fold speedup for 
1 and 2 axes of symmetry. This can be achieved at each level with symmetric 
partitions. Note such enhancements are in addition to the benefits of the MLT. For 
the 3-D case, k v k x k 2 and k, k 2 k 3 partitions must be established for 1, 2 and 3 
axisymmetric states. Again, for linear substructure, such symmetries enable at 
least a 2, 4 or 8 fold speedup via cloning at each possible level. 

Note beyond defining the cloning and substructuring characterization, 
symmetry properties can be employed to define multiply seeding points from which 
to initiate simultaneous partitioning. Two procedures are possible, i.e. 

i) The consecutive node attachment (CNA) scheme of Wilson and Farhat 1171 
and 

ii) The direct element connect filling (DECF) scheme 

In the Wilson-Farhat scheme, substructuring is automated into a series of 
recursive steps involving: 

1) Bandwidth minimization 

2) Connection of elements attached to nodes of ascending order 

3) Partition definition completed when appropriate number of attachments 
achieved 

4) Procedure repeated with already defined substructure substracted from 
process: at this point bandwidth minimization preapplied to the reduced 
model. 

Due to the recursive use of bandwidth minimization, the current procedure is 
limited to starting from the initial node number. This follows from the fact that due 
to the goal of reducing the skyline height, it is possible to have noncontiguous node 
numbering. Hence, potentially nonconnected, i.e. discontinuous partitions may 
develop during the agglomeration process. This is especially true if starting points 
further up the node count were attempted. 
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To bypass the foregoing difficulty, a DECF scheme will be employed. Here the 
steps include 

1) Establish multiple starting points; Fig. 4.1 

2) Using element connectivity map, adjoin elements directly connected to 
starter nodes, Fig. 4.2; proceed simultaneously at each initiation site 

3) Continue process in successive waves of attachment; Figs. 4.3 and 4.4 

4) In symmetrical problems, attachments are preformed in balanced pairs; 
Fig. 4.2 

5) Fill process continued until requisite number of elements adjoined - con- 
tingent on number of partitions defined for level. 

The starting points are chosen via similar criteria to traditional bandwidth 
minimizer, (Cuthill-McKee 1241 , Gibbs-Poole-Stockmeyer 1251 ), i.e.: 

1) Determine points with minimum connectivity - outside corners, kinks, 
edges; 

2) Due to symmetry, several starting points may be employed simultaneously 

• 2-D/2 axis - 4 points/1 axis - 2 points 

• 3-D/3 axis - 8 points/2 axis - 4 points 

3) Beyond providing seeding points, symmetry can be used to define con- 
straint surfaces which control the growth of partitions during the agglom- 

7 ' 

eration process, in particular 

i. 2-D problems, Fig. 4.5 

• 1 axis - single axis disection 

• 2 axis - two axis 

ii. 3-D problems, Fig. 4.5 

• 1 axis/prismatic - single planar dissection 

• 2 axis/prismatic - two plane 

• 3 axis - three plane 
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As noted earlier, the partitioning process can be reapplied at each succeeding 
level. In such applications global-root level symmetries may give way to symmetric 
or asymmetric intermediate levels, Fig. 4.6. If multiple levels of symmetry are 
noted, Fig. 4.6, cloning where possible can lead to significant speedup. 

Once each of the various substructure are defined, the associated external- 
internal node number count must be minimized. The traditional bandwidth mini- 
mizers 1261 are formulated to minimize the skyline height associated with individual 
nodes. In this context the nodes can be shifted to any location on the diagonal. For 
substructural problems, the external nodes must be positioned at the base or top of 
the diagonal. This can cause a suboptimal arrangement between the externals and 
internals. Noting Fig. 4.7, minimizing strictly the internals yields a skyline 
structure with large coupling side bands. For instance considering the square region 
discussed earlier, the overall computational effort for a typical column elimination 
operation is given by the expression. Fig. 4.8 

Single Column Effort — + E ^ + E 4 (91) 

such that 

Ej —Effort in purely internal block 
E 2/3 —Side band effort 
E 4 —External block 
When averaged overall rows, we obtain 

Net Effort - Ej + + E 4 (92) 

where 

Ej~^(N-2) 2 N 4 

E 2/3 ~4N(N-D (N- 2) 2 


(93) 

(94) 
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(95) 


E 4 ~4(N-\f (N- 2) 2 

Asymptotically, as N -*■ (large), it follows that 

9 4 

Net Effort - N* 

2 

In this context, substructuring yields a 450% increase in work load over the straight 
solution. 

The root cause of the increase lies in the fact that the bandwidth minimization 
did not account for the fact that the optimal treatment of the problem involves the 
handling of the multipoint constraint defined by the externals. In view of this, 
instead of starting the search for the minimum skyline profile from single least 
connected internal nodes, the substructured version could alternatively start from 
the externals. Based on this, the operational steps of the minimizer become 

1. Starter points involve all externals 

2. Determine all directly connected internal nodes from connectivity map — 
this generates a "shell” of nodes 

3. Adjoin shell directly to starter nodes 

4. Determine next layer of direct connected internal nodes 

5. Adjoin to previous shell 

6. Continue process in successive shells of attachment, Fig. 4.9. 

Overall the procedure generates an inwardly spiralling numbering pattern. To 
illustrate the improved computational efficiency of such a skyline, we reconsider the 
square patch. Noting Fig. 4.10, the spirally count causes the succeeding attachment 
shells to have reduced bandwidths, i.e. going from 4N-4 for the external first shell to 
4N-4-8(€-l) for the f th internal shell. The population of nodes associated with such 
skyline heights reduce as one moves inward into the substructure. Namely, from 
(4N-4) in the first shell to 4N-4-8(€-l) for the € lh shell. This is illustrated in Fig. 4.11. 
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To determine the work effort associated with the foregoing numbering scheme, 
it is noted that unlike the traditional approach, no large coupling side bands are 
encountered. The computational effort can be cast in the form 


N 


n 


Net Effort ~ \ \ (4JV+ 4 — S€) 3 — - (4N-4) 3 , 

2 


2 e-i 


(97) 


which for arbitrary N yields 


Net Effort ~4N 2 (N 2 - 2)- - (4N-4) 3 (gg) 

2 

From an asymptotic point of view, the net computation effort associated with the 
succeeding attachment shell numbering pattern takes the form 

Net Effort — 4N 4 

This represents a potential 12 + % improvement over the classical approach, i.e. Eq. 
(91). Note such improvements are strongly dependent on geometry, element type 
and element density within a given region. Hence, care must be exercised in its use. 

As has been seen by Padovan and Gute 1191 , the choice of the optimal hierarchy 
of levels and partitions is highly problem dependent. In this context to determine 
the best arrangement an iterative strategy must be implemented. Overall it consists 
of the following steps 

1) Recursively choose number of levels 

2) Partition each level through various permutations 

3) Estimate work load for bandwidth minimized partitions 

4) Compare work loads of various level/partition arrangements to achieve 
optimal results. 
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V. Discussion - Benchmarking 

In the previous sections, a hierarchically scaled multilevel nonlinear equation 
solver was developed. Overall the procedure provides for the possibility of assigning 
separate constraints to each partition throughout all the levels of the MLT. The 
focus of the benchmarking of the scheme will be two fold, namely to illustrate 

1) The numerical performance of the nonlinear solver and 

2) The memory-computational savings afforded by the Tree scheme. 

Figure 5.1 illustrates the arch type truss structure used to evaluate the 

numerical/iterative performance. The arch was chosen since it possesses highly 
nonlinear force deflection attributes. These are illustrated in Fig. 5.2. Depending on 
whether compressive or tensile loading is applied, either softening or hardening 
behavior is excited. The associated deflected shapes are given in Fig. 5.3. Such 
nonlinear characteristics yield varying numerical sensitivities. In this context, 
Tables 5. 1-5.3 illustrate the effects of initial step size, truss geometry and loading 
direction on the numerical convergence. Various types of iterative update schemes 
were evaluated. These include 


Root 

Second Level 

Full 

Full 

BFGS 

Full 

BFGS 

No update 

No update 

Full 

No update 

No update 

Automatic 

Automatic 


For the. automatic case, updating and constraint control is triggered by the 
appropriate criteria. Here due to the large deformation/rotation behavior associated 
with the truss, a rotation check can be employed. Once the requisite change in 
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rotation is accumulated, the local partition stiffness is updated. Here both full and 
BFGS type schemes were tested. 

As can be seen from Tables 5.1 & 5.2, the full and automated update schemes 
yielded essentially the same results. In the case of the automated scheme, signifi- 
cantly less updating was required. For the cases noted, the automatic scheme 
yielded update savings of 30-40%. This was achieved with no changes recorded in 
comparisons with the global scheme. 

When localized constraints are introduced for each partition, convergence was 
obtained for all the load ranges considered. Note care must be taken when 
employing such a robust approach especially for problems with multiple solution 
states. In such situations, the load readjustment generated, by the individual 
constraints may cause movement to different loading paths thereby leading to an 
alternative solution state. This can be prevented by tightening up on the admissible 
dependent field excursions generated during successive iterations. 

For large loads, generally an incremental application is necessary. The arches 
sensitivity to such a loading approach is depicted in Table 5.3. As can be seen, as the 
increment is decreased, the iterative requirements become essentially the same for 
all the schemes. Conversely, in the case of the largest increment, only the full and 
automated updating schemes converge. Here the automated scheme requires 37% 
less updating. This is a result of the possibility of intermittent reformation at the 
various partition levels. This of course is highly dependent on the geometry/connec- 
tivity as well as the loading history of the model treated. 

To illustrate the parallel attributes of the nonlinear MLT scheme, we will con- 
sider the large scale truss problem depicted in Fig. 5.4. The model was substructured 
into a three level tree consisting of 1, (3) 2 and (9) 2 , 1st, 2nd and 3rd level square 
partitions. The problem was tested on the NASA Lewis Alliant system with eight 
available processors. This enabled a partially parallel application wherein multiple 
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processor assignment was employed. To simplify the programming, all the 
processors were assigned to a given substructure. At the top/third level, the. effort 
includes both updating, assembly condensation and partial assembly for the next 
level. Once completed, all the processors are reassigned to the next partition and so 
on in a succession of assignment steps. The next lower branch level is performed in a 
level by level format down the MLT. Once the root is handled, the condensation step 
is complete. Next the process is reversed level by level for the backsubstitution step. 
Two forms of testing were considered, i.e. single and multiple processor assignment. 
This enabled the evaluation of potential system contention problems. 

For the given problem and style of Fortran programming, 25 edge nodes 
defined the break even point of the chosen MLT. For larger problems, significant 
improvements can be obtained. For instance at 41 and 56 per side, the single 
assignment (purely sequential) scheme yielded speedups of 2. and 3. relative to the 
global scheme. Employing 8 processors, the 41 and 56 edge noded truss yielded 
speedup factors of 11.4 and 15.5 respectively. In contrast, for the 25 edge noded case, 
a factor 6.02 was recorded. One would have expected a (two/three) to one difference 
between the 25 and (41/56) noded examples. The variation is a result of the 
increasing contention which occurs as problem size increases. The increased number 
of memory fetches required by larger partitions causes increased traffic control 
problems in the system buss thereby delaying computations. In this context, for 
architecture employing multiple processor assignment, smaller partitions are more 
advantages. This is clearly illustrated in Table 5.4 which describes the contention 
problem as a function of size and number of processors: this is of course machine 
dependent. 

Next we shall consider the problem of how to select the proper number of 
partitions and levels to yield optimal results. As an initial demonstration, we shall 
consider a square mesh defined by 2D four node quad elements. Employing either 
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model fill or the Wilson-Farhat 1171 scheme, the region is dissected into 2, 3, . . 9, . . 
partitions, i.e. Fig. 5.5. Noting Fig. 5.5, optimal speedup results are obtained for (k) 2 
type arrangements, i.e. symmetrical square partitions. Applying the model fill 
scheme recursively on a level by level basis, the same («) 2 type decomposition yields 
the most optimal results at all the various branch levels, i.e. top, intermediate and 
root. Based on such an approach, Tables 5.5 and 5.6 illustrate the effort and memory 
reduction potential of the tree scheme, for several sized problems. As can be seen, 
many order of magnitude reductions can be obtained. 

In the case of multiply connected regions, the use of the CNA 117 * scheme yields 
nonoptimal partitions. This follows from the fact that bandwidth minimization often 
leads to noncontinguous node numbering. During the agglomeration process used to 
define the partitions, potential nonconnected or disjoint substructure are possible. 
Consider the meshes depicted in Figs. 5.6 and 5.7. The CNA 1171 procedure tends to 
lead to distended substructural regions depicted in Fig. 5.8. In contrast, the DECF 
scheme yields more regular subregions, i.e. Fig. 5.9. Here a variety of seeding points 
were employed to start and continue the process. 

To reduce the periphery of the various partitions, salient control was applied on 
the second pass. Overall the procedure consists of checking boundary attached 
elements for the level of connectivity within their partition. Specifically, the 
number of adjacencies bordering each edge of a given element are determined. By 
employing the connectivity information, the number of elements bordering the 
candidate element are determined in each of the ± and parallel directions to the 
interconnect boundary. If the noted element has less than a preset number of 
adjacencies along one direction, then it is adjoined to the appropriate neighboring 
substructure with the requisite connectivity. The results of such a process is 
illustrated in Fig. 5.9. Had model symmetry been employed, the results depicted in 
Figs. (4.1-4) would have been achieved. This of course is highly model dependent. 


34 


To establish the improvement potential of the DECF based partitioning 
scheme, both the perimeter to internal node ratio and individual substructural-net 
computational effort are determined. These are used to quantify the comparison 
with the CNA scheme of Farhat and Wilson 1171 . In particular, recalling the models 
defined in Figs. 5.6 and 5.7, Figs. 5.10-5.14 illustrate the topological arrangements 
for varying levels of automated substructuring, i.e. 4, 9, 16, . . partitions. 

To provide a consistent basis for comparison, the reverse Cuthill-McKee 
scheme 1241 is employed to locate the first, i.e. seed note on the diagonal. The proce- 
dure is reapplied to each succeeding reduced model generated by subtracting pre- 
viously defined substructure from the original formulation. Overall the procedure 
lead to the effort comparisons described in Table 5.7. As can be seen, the DECF 
method consistently out performed the CNF. Significant factors of improvement 
were noted over a wide range of partitions choices. 

An unexpected benefit of the Tree scheme arises for problems involving 
repeated-multilevel symmetries, i.e. Fig. 5-15. For such structure, the Tree reduces 
the burden of assembly and condensation in linear partitions. In particular, if a 
problem such as that depicted in Fig. 5.15 remains linear for several steps, the unit 
celTillustrated is all that needs to be generated and translated-rotated to yield th e 
rest of. the substructure. Hence the work load at the top level would be essentially 
that of a parallelized setup. The same would be true of each lower level. Hence the 
condensation speedup and memory reduction would be that associated with a single 
assignment parallelized Tree, i.e. a separate processor for each partition. During the 
backsubstitution phase, a similar procedure applies. Overall depending on - problem 
topology, significant order of magnitude speedup could be achieved in large scale 
problems. 
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Figure and Table Captions 


Fig. No. 
2.1 

2.2 

2.3 

2.4 

3.1 

3.2 

3.3 

3.4 

3.5 

3.6 

4.1 

4.2 

4.3 

4.4 

4.5 

4.6 

4.7 

4.8 

4.9 

4.10 

4.11 


Caption 

Successive iterations of elliptically constrained Newton Raphson 
Scheme 

2-D square and 3-D cubic regions 

Sequential and partially parallel flow of control: single processor 
assignment 

Partially parallel with multiprocessor reassignment 

Multilevel tree defining linkages between successive substructural 
levels 

Flow of control of HPT 

Sequential flow of control of multilevel tree 

Subparallel single assignment processor leap frogging: Multilevel 
Subparallel multiple assignment processor leap frogging: Multilevel 
Super/iso parallel processor reassignment process 
Multiple starting points for 2-D 2 axis of symmetry model 
Balanced multiple initiation point agglomeration 
Successive waves of attachment: 2 axes of symmetry 
Successive waves of element attachment 

Potential lines and planes of symmetry in 2-D and 3-D defining 
constraint surfaces 

Multiple levels of substructural symmetry 

Typical skyline of substructured component with minimization 
employed or internal variables 

Work load associated with elimination of ith column: Internally band- 
width minimized substructured partition 

Successive shell attachment scheme for node numbering substructured 
components 

Spiralling node connectivity generated by successive shell attachment 
scheme 

Skyline of spiralling node connectivity model 
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5.1 


Arch truss simulation and two level tree 


5.2 

5.3 

5.4 

5.5 

5.6 

5.7 

5.8 

5.9 

5.10 

5.11 

5.12 

5.13 

5.14 

5.15 

Table No, 

5.1 

5.2 

5.3 

5.4 


Softening and hardening load deflection characteristics of truss: 

e = io° 


Compressive and tensile deflected shapes of truss (Fig. 5.1) 

Large scale truss structure used to test parallel version of nonlinear 
multilevel tree 

Effect of number of partition on speedup: two level tree: partition 
selection via Wilson-Farhat Scheme' 17 ': 8 processors 

Multiple connected mesh: several cells: Model 1 

Multiple connected mesh: multiple cells: Model 2 

Mesh partitioning: Wilson-Farhat Scheme 1171 

Model filling mesh partitioning 
A - CCW fill - CCW speeding 
B-CCW fill -CC speeding 
C - Balanced fill and speeding 

Partitioning of Model 1: 4 partitions 

Partitioning of Model 1: 9 partitions 

Partitioning of Model 2: 4 partitions 

Partitioning of Model 2: 9 partitions 

Partitioning of Model 2: 16 partitions 

Multilevel symmetries 


Caption 

Load step sensitivity of various update schemes: arch 60°; compressive 
loading (tolerance 10' 6 /10' 3 ) 

Load step sensitivity of various update schemes: arch 10° tensile load 
(tolerance 10' 6 /10' 3 ) 

Multiload step sensitivity of various update schemes: arch 60° (No. of 
increments/total iterations) 

Efficiency (%) of Alliant Parallel Processor 
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5.5 


Effects of problem size on speedup and tree morphology: sequential 
case 

5.6 Effects of problem size on speedup and tree morphology: parallel case 

5.7 Comparison ofDECF and CNF schemes of partitioning 
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FIGURE 2 - 3 
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FIGURE 3.2 
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Load (KIPS) 

1 

2 

4 

6 

8 

.10 

U-U 

4/3 

5/3 

5/3 

6/4 

6/4 

6/4 

BFGS-U 

6/4 

9/4 

11/7 

16/9 

29/15 

43/17 

N-U - 

10/5 

17/8 

* 

* 


* 

BFGS-N 

9/5 

14/8 

* 

* 

* 

* 

N-N 

12/5 

34/15 

* 

★ 

* 

* 

Auto 

4/3 

5/3 

5/3 

6/4 

6/4 

6/4 


Table 5.1 


*Failed to converge in 50 iterations 



Load (KIPS) 

1 

2 

4 

6 

8 

10 

U-U 

4/3 

5/3 

5/3 

6/3 

6/4 

7/4 

BFGS-U 

7/4 

9/6 

12/7 

16/9 

27/19 

30/21 

N-U 

9/4 

17/8 

* 

* 

* 

* 

BFGS-N 

13/5 

20/10 

* 

* 

* 

* 

N-N 

12/6 

33/15 

* 

★ 

* 

* 

Auto 

4/3 

5/3 

5/3 

6/3 

6/4 

7/4 


*Failed to converge in 50 iterations 
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10 (KIPS) 
Target 

Number of Load Increments 

2 

4 

6 

8 

10 

20 

U 

7/17 

10/25 

13/26 

15/30 

.18/36 

32/64 

BFGS 

* 

10/30 

13/42 

15/50 

18/38 

32/64 

N 

* 

* 

13/61 

15/52 

18/64 

32/64 

U-U 

7/17 

10/25 

13/26 

15/30 

18/36 

32/64 

N-U 

* 

10/45 

13/42 

15/52 

18/38 

32/64 

N-N 

* 

* 

13/61 

15/52 

18/64 

32/64 

BFGS-U 

* 

10/30 

13/26 

15/30 

18/36 

32/64 

Auto 

7/17 

10/25 

13/26 

15/30 

18/36 

32/64 


* Failed to converge in 50 iterations 
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EBB 

Number of Processors 

2, 

3 

A 

5 

6 

7 

225 

450 

500 

600 

700 

800 

900 

1000 

1100 

1200 

1500 

2000 

93.66 

95.43 

96.84 

96.65 

95.99 

97.76 

97.12 

97.93 

96.55 

96.49 

95.69 

96.15 

86.22 

90.22 

92.30 

93.02 

92.05 

92.89 

91.68 

92.68 

92.89 
92.04 

90.90 
90.39 

80.28 

87.62 

89.24 

87.68 

87.25 
88.20 
89.15 

88.69 
86.12 
87.56 
85.21 
84.97 

76.58 

79.72 

82.20 

81.38 
81.21 
86.51 
85.65 
84.46 
84.18 

83.38 
82.54 
81.68 

72.36 
79.34 
' 81.96 
82.10 
80.81 
81.60 
81.31 
82.16 
81.99 
80.25 
78.82 
76.23 

67.88 
73.69 
74.93 
76.33 
76.52 • 
80.68 
78.02 
78.11 
78.41 
76.32 
74.58 
71.73 
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50 
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50 

5 

5.44 

1.99 

H 

2 

2 
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100 

2 

5.88 

2.77 

wl 








100 

3 

8.78 

3.24. 

2 

3 







100 

4 

10.28 

3.60 

2 

2 

3 






100 

5 

10.71 

3.48 

2 

2 

2 

2 





100 

6 

10.71 

3.27 

2 

2 

2 

2 

2 





2 

26.10 

7.45 

9 








1000 

3 

60.14 

12.57 

3 

6 







1000 

4 

82.54 

16.56 

2 

3 

5 






1000 

5 

95.88 

19.65 

2 

2 

3 

4 






6 

101.87 

21.31 

2 

2 

2 

3 

3 




1000 

7 

104.58 

21.58 

2 

2 

2 

2 

2 

3 



1000 

8 

105.24 

21.41. 

2 

2 

2 

2 

2 

2 

2 


1000 

9 

105.34 

20.88 

2 

2 

2 

2 

2 

2 

2 

2 


N-problem size 
L — number of 
S — speedup 

— memory reduction 

K ± -number of partitions / i level/ 

per edge 
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1000 

1000 

1000 

1000 

1000 

1000 

1000 

1000 


N-problem size. 
L-ruamber of leve 1 s 

P-number of processors 

S — speedup 

H^-memory reduction 
K . -number of partitions 

JL 


/ i th lovol/ p>err 
p>orr edge 
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Number of Partitions 

Effort Ratio 
(DECF/CNF) 

Model 1: Fiq. 5.6 
4 

.398 

9 

.25 

Model 2: Fiq. 5.7 
4 

.78 

9 

.416 

16 

.327 


Table 5.7. 
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